Search CORE

59 research outputs found

Memory Migration on Next-Touch

Author: Furmento Nathalie
Goglin Brice
Publication venue: HAL CCSD
Publication date: 15/07/2009
Field of study

International audienceNUMA abilities such as explicit migration of memory buffers enable flexible placement of data buffers at runtime near the tasks that actually access them. The move_pages system call may be invoked manually but it achieves limited throughput and implies a strong collaboration of the application. Indeed, the location of threads and their memory access patterns must be carefully known so as to decide when migrating the right memory buffer on time. We present the implementation of a Next-Touch memory placement policy so as to enable automatic dynamic migration of pages when they are actually accessed by a task. We introduce a new PTE flag setup by madvise, and the corresponding Copy-on-Touch codepath in the page-fault handler which allocates the new page near the accessing task. We then look at the performance and overheads of this model and compare it to using the move_pages system call

INRIA a CCSD electronic archive server

Enabling High-Performance Memory Migration for Multithreaded Applications on Linux

Author: Furmento Nathalie
Goglin Brice
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/05/2009
Field of study

International audienceAs the number of cores per machine increases, memory architectures are being redesigned to avoid bus contention and sustain higher throughput needs. The emergence of Non-Uniform Memory Access (NUMA) constraints has caused affinities between threads and buffers to become an important decision criteria for schedulers. Memory migration enables the dynamically joined distribution of work and data across the machine but requires high-performance data transfers as well as a convenient programming interface. We present the improvement of the Linux migration primitives and the implementation of a Next-Touch policy in the kernel to provide multithreaded applications with an easy way to dynamically maintain thread-data affinity. Microbenchmarks show that our work enables a high-performance, synchronous and lazy memory migration within multithreaded applications. A threaded LU factorization then reveals the large improvement that our Next-Touch policy model may bring in applications with complex access patterns

INRIA a CCSD electronic archive server

Finding a Tradeoff between Host Interrupt Load and MPI Latency over Ethernet

Author: Furmento Nathalie
Goglin Brice
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/09/2009
Field of study

International audienceAchieving high-performance message passing on top of generic Ethernet hardware suffers from the NIC interrupt-driven model where coalescing is usually involved. We present an in-depth study of the impact of interrupt coalescing on the Open-MX performance. It shows that disabling coalescing may not be relevant for most metrics except small-message latency. Two new coalescing strategies are then presented so as to efficiently support both latency-friendly and coalescing-friendly workloads thanks to the NIC looking at Open-MX messages and streams before deciding when to raise interrupts. The implementation of these strategies in the firmware of Myri-10G NICs shows that Open-MX is now able to achieve a low small-message latency, a high large-message throughput, and a satisfying message rate without having to manually tune the coalescing delay depending on the benchmark. Real application performance evaluation further shows that our modifications even improve the NAS Parallel Benchmark IS execution time by 7-8% thanks to our NIC firmware raising up to 20% of additional interrupts at the correct time

INRIA a CCSD electronic archive server

Optimisation Mechanisms for MPICH/Madeleine

Author: Furmento Nathalie
Mercier Guillaume
Publication venue: HAL CCSD
Publication date: 01/01/2005
Field of study

This report presents optimisations mechanisms within MPICH/Madeleine , the implementation of MPICH over Madeleine. These mechanisms aim to decrease the communication time of derived datatypes for which data is stored in noncontiguous memory areas. The report presents the mechanisms as well as some performance evaluation

INRIA a CCSD electronic archive server

MPICH/Madeleine Installer's, User's and Developer's Guide

Author: Furmento Nathalie
Mercier Guillaume
Publication venue: HAL CCSD
Publication date: 01/01/2005
Field of study

MPICH/Madeleine is a new free implementation of the MPI standard based on the MPICH implementation and the multi-protocol communication library called Madeleine. It aims to efficiently exploit clusters of clusters with heterogeneous networks. This manual presents an installer's, user's and developer's guide for MPICH/ Madeleine. The latest version of this document is available from the following URL: http://runtime.futurs.inria.fr/mpi/manual/

INRIA a CCSD electronic archive server

Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite

Author: Aumage Olivier
Broquedis François
Brunet Pierrick
Furmento Nathalie
Gautier Thierry
Thibault Samuel
Virouleau Philippe
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

International audienceThe recent introduction of task dependencies in the OpenMP specifi-cation provides new ways of synchronizing tasks. Application programmers can now describe the data a task will read as input and write as output, letting the runtime system resolve fine-grain dependencies between tasks to decide which task should execute next. Such an approach should scale better than the excessive global synchronization found in most OpenMP 3.0 applications. As promising as it looks however, any new feature needs proper evaluation to encourage applica-tion programmers to embrace it. This paper introduces the KASTORS benchmark suite designed to evaluate OpenMP tasks dependencies. We modified state-of-the-art OpenMP 3.0 benchmarks and data-flow parallel linear algebra kernels to make use of tasks dependencies. Learning from this experience, we propose extensions to the current OpenMP specification to improve the expressiveness of dependen-cies. We eventually evaluate both the GCC/libGOMP and the CLANG/libIOMP implementations of OpenMP 4.0 on our KASTORS suite, demonstrating the in-terest of task dependencies compared to taskwait-based approaches

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

NewMadeleine: a Fast Communication Scheduling Engine for High Performance Networks

Author: Aumage Olivier
Brunet Elisabeth
Furmento Nathalie
Namyst Raymond
Publication venue: HAL CCSD
Publication date: 01/01/2007
Field of study

International audienceCommunication libraries have dramatically made progress over the fifteen years, pushed by the success of cluster architectures as the preferred platform for high performance distributed computing. However, many potential optimizations are left unexplored in the process of mapping application communication requests onto low level network commands. The fundamental cause of this situation is that the design of communication subsystems is mostly focused on reducing the latency by shortening the critical path. In this paper, we present a new communication scheduling engine which dynamically optimizes application requests in accordance with the NICs capabilities and activity. The optimizing code is generic and portable. The database of optimizing strategies may be dynamically extended

INRIA a CCSD electronic archive server

Dynamic Task and Data Placement over NUMA Architectures: an OpenMP Runtime Perspective

Author: Broquedis François
Furmento Nathalie
Goglin Brice
Namyst Raymond
Wacrenier Pierre-André
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 03/06/2009
Field of study

International audienceExploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid memory access penalties. Directive-based programming languages such as OpenMP provide programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into ``scheduling hints'' to solve thread/memory affinity issues. It enables dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. First experiments show that mixed solutions (migrating threads and data) outperform Next-touch-based data distribution policies and open possibilities for new optimizations

INRIA a CCSD electronic archive server

StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators

Author: Augonnet Cédric
Aumage Olivier
Furmento Nathalie
Namyst Raymond
Thibault Samuel
Publication venue: HAL CCSD
Publication date: 16/05/2014
Field of study

GPUs have largely entered HPC clusters, as shown by the top entries of the latest top500 issue. Exploiting such machines is however very challenging, not only because of combining two separate paradigms, MPI and CUDA or OpenCL, but also because nodes are heterogeneous and thus require careful load balancing within nodes themselves. The current paradigms are usually limited to only offloading parts of the computation and leaving CPUs idle, or they require static work partitioning between CPUs and GPUs. To handle single-node architecture heterogeneity, we have previously proposed StarPU, a runtime system capable of dynamically scheduling tasks in an optimized way on such machines. We show here how the task paradigm of StarPU has been combined with MPI communications, and how we extended the task paradigm itself to allow mapping the task graph on MPI clusters such as to automatically achieve an optimized distributed execution. We show how a sequential-like Cholesky source code can be easily extended into a scalable distributed parallel execution, and already exhibits a speedup of 5 on 6 nodes

INRIA a CCSD electronic archive server

StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators

Author: Augonnet Cédric
Aumage Olivier
Furmento Nathalie
Namyst Raymond
Thibault Samuel
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 23/09/2012
Field of study

International audienceGPUs clusters are becoming widespread HPC platforms. Ex- ploiting them is however challenging, as this requires two separate paradigms (MPI and CUDA or OpenCL) and careful load balancing due to node heterogeneity. Current paradigms usually either limit themselves to of- fload part of the computation and leave CPUs idle, or require static CPU/GPU work partitioning. We thus have previously proposed StarPU, a runtime system able to dynamically scheduling tasks within a single heterogeneous node. We show how we extended the task paradigm of StarPU with MPI to easily map the task graph on MPI clusters and automatically benefit from optimized execution

INRIA a CCSD electronic archive server